Statistical Language Modeling for Historical Documents using Weighted Finite-State Transducers and Long Short-Term Memory

نویسنده

  • Mayce Ibrahim Ali Al Azawi
چکیده

The goal of this work is to develop statistical natural language models and processing techniques based on Recurrent Neural Networks (RNN), especially the recently introduced Long ShortTerm Memory (LSTM). Due to their adapting and predicting abilities, these methods are more robust, and easier to train than traditional methods, i.e., words list and rule-based models. They improve the output of recognition systems and make them more accessible to users for browsing and reading. These techniques are required, especially for historical books which might take years of effort and huge costs to manually transcribe them. The contributions of this thesis are several new methods which have high-performance computing and accuracy. First, an error model for improving recognition results is designed. As a second contribution, a hyphenation model for difficult transcription for alignment purposes is suggested. Third, a dehyphenation model is used to classify the hyphens in noisy transcription. The fourth contribution is using LSTM networks for normalizing historical orthography. A size normalization alignment is implemented to equal the size of strings, before the training phase. Using the LSTM networks as a language model to improve the recognition results is the fifth contribution. Finally, the sixth contribution is a combination of Weighted Finite-State Transducers (WFSTs), and LSTM applied on multiple recognition systems. These contributions will be elaborated in more detail. Context-dependent confusion rules is a new technique to build an error model for Optical Character Recognition (OCR) corrections. The rules are extracted from the OCR confusions which appear in the recognition outputs and are translated into edit operations, e.g., insertions, deletions, and substitutions using the Levenshtein edit distance algorithm. The edit operations are extracted in a form of rules with respect to the context of the incorrect string to build an error model using WFSTs. The context-dependent rules assist the language model to find the best candidate corrections. They avoid the calculations that occur in searching the language model and they also make the language model able to correct incorrect words by using contextdependent confusion rules. The context-dependent error model is applied on the university of Washington (UWIII) dataset and the Nastaleeq script in Urdu dataset. It improves the OCR results from an error rate of 1.14% to an error rate of 0.68%. It performs better than the state-of-the-art single rule-based which returns an error rate of 1.0%. This thesis describes a new, simple, fast, and accurate system for generating correspondences

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Spoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting

Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...

متن کامل

Prediction of Covid-19 Prevalence and Fatality Rates in Iran Using Long Short-Term Memory Neural Network

Introduction: The rapid spread of COVID-19 has become a critical threat to the world. So far, millions of people worldwide have been infected with the disease. The Covid-19 pandemic has had significant effects on various aspects of human life. Currently, prediction of the virus's spread is essential in order to be safe and make necessary arrangements. It can help control the rate of its outbrea...

متن کامل

Prediction of Covid-19 Prevalence and Fatality Rates in Iran Using Long Short-Term Memory Neural Network

Introduction: The rapid spread of COVID-19 has become a critical threat to the world. So far, millions of people worldwide have been infected with the disease. The Covid-19 pandemic has had significant effects on various aspects of human life. Currently, prediction of the virus's spread is essential in order to be safe and make necessary arrangements. It can help control the rate of its outbrea...

متن کامل

Investigations on search methods for speech recognition using weighted finite-state transducers

The search problem in the statistical approach to speech recognition is to find the most likely word sequence for an observed speech signal using a combination of knowledge sources, i.e. the language model, the pronunciation model, and the acoustic models of phones. The resulting search space is enormous. Therefore, an efficient search strategy is required to compute the result with a feasible ...

متن کامل

Modeling Gasoline Consumption Behaviors in Iran Based on Long Memory and Regime Change

In this study, for the first time, we model gasoline consumption behavior in Iran using the long-term memory model of the autoregressive fractionally integrated moving average and non-linear Markov-Switching regime change model. Initially, the long-term memory feature of the ARFIMA model is investigated using the data from 1927 to 2017. The results indicate that the time series studied has a lo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015